Method for ad-hoc parallel processing in a distributed environment

ABSTRACT

An overall processing time to rasterize, at the first device, the electronic document to be rendered is computed. Also, a rendering time to render, at the first device, the electronic document to be rendered is computed. When the overall processing time to rasterize at the first device is greater than the rendering time to render at the first device, the electronic document to be rendered is parsed into a first document and sub-documents. A productivity capacity of each node is determined, the productivity capacity being a measured of the processing power of the node and the communication cost of exchanging information between the first device and the node. A sub-document is rasterized at a node when a productivity capacity of the node reduces the processing time to rasterize the electronic document to be rendered to be less than the computed overall processing time. The rasterized first document and each rasterized sub-document are aggregated to create a rasterized electronic document to be rendered at the first device.

BACKGROUND

Digital multifunction reprographic systems have grown in sophistication and function. In addition, digital multifunction reprographic systems are often used in environments where several identical or similar machines are present, all of which are connected via some sort of high speed/high bandwidth network. Since some of the jobs that may be submitted to such a machine require extensive computation, it is desirable that a machine can distribute large computing tasks to other cooperating machines on its network. Such sharing can expedite the processing of large or complex tasks.

To provide a job sharing capability, how a job should be partitioned between the processing components should be determined. Conventional approaches use fixed partitioning schemes or consider computational capabilities. However, these conventional approaches do not consider the communication properties of the network.

As noted above, a number of office devices may be connected to a network. Conventionally, these devices are largely independent, and the user interacts with a single device. However, there is an opportunity for a collection of such devices to act collaboratively by sharing resources and cooperating to perform a user's task.

An example of such a task might be the ripping of a very large, rich print job. Documents are typically described in a page description language (PDL) format such as PCL, PostScript, or PDF. The PDL provides a series of drawing commands and as part of the printing process these commands must be converted to a raster image. Frequently, it can take longer to do this rasterization than it takes to image the raster pattern on the paper. Thus, the printer may sit in an idle state while the document is being prepared.

To improve the productivity of the printer, a conventional system has been developed that consists of an array of processors and software to separate the print job into pages and to farm the pages out to various processors for parallel rasterization. However, this is a specialized (and expensive) piece of hardware, and would not be appropriate for occasional use in the office environment.

Therefore, it is desirable to provide a system enables a sharing of resources without relying upon expensive hardware.

Moreover, it is desirable to provide a system where the document is separated into pages that are then farmed out to neighboring multi-functional devices for parallel rasterization.

BRIEF DESCRIPTION OF THE DRAWING

The drawings are only for purposes of illustrating various embodiments and are not to be construed as limiting, wherein:

FIG. 1 illustrates an example of a spanning tree;

FIGS. 2 through 4 illustrate a method of distributing a computing job among a number of communicating computing devices on a network;

FIG. 5 illustrates an example of a one level tree;

FIG. 6 illustrates an example of a grid environment;

FIG. 7 illustrates a comparison of overall execution time;

FIG. 8 illustrates a comparison of speedup and efficiency factors;

FIG. 9 graphically illustrates an example of a calculation of a group capability of a node;

FIG. 10 graphically illustrates a Speedup(n) relationship;

FIG. 11 graphically illustrates an Efficiency(n) relationship;

FIG. 12 graphically illustrates another example of a calculation of a group capability of a node;

FIG. 13 graphically illustrates another Speedup(n) relationship; and

FIG. 14 graphically illustrates another Efficiency(n) relationship.

DETAILED DESCRIPTION

For a general understanding, reference is made to the drawings. In the drawings, like references have been used throughout to designate identical or equivalent elements. It is also noted that the drawings may not have been drawn to scale and that certain regions may have been purposely drawn disproportionately so that the features and concepts could be properly illustrated.

As noted above, it is desirable to provide a system where the document is separated into pages that are then farmed out to neighboring multi-functional devices for parallel rasterization.

The rasterized results could then be compressed and sent back to the printing device for imaging. In this scenario, the system would need to decide which of the available neighboring machine should be used, and how many pages each machine should be given. The ideal partitioning could consider productivity of the candidate machines, but that productivity depends not only on the processing power of the device, but also upon the communication costs in interacting with the device.

It is noted that rasterization and rendering of a document is only one of a number of possible functions that a multifunction device might perform. For example, a multifunction device might scan a document and apply optical character recognition techniques to discover the text that the page images contain. Here again, the scanned document could be partitioned into pages that could be distributed to candidate machines in order to conduct the optical character recognition processing in parallel. The device handling would include the disposition of the optical character recognition results. Other types of processing such as format conversion or translations are also candidates for parallel processing.

In general, the disposition of the document by the devices is referred to as the actions carried out by the device such as rendering or storing the document. Thus, the time to render or store a document is generally referred to as the disposition time to dispose of the document. Also, the work needed to prepare the document for disposition (such as rasterization or optical character recognition) is generally referred to as the processing carried out on the document. Thus, the time to rasterize a document or to optically recognize the characters in the document is generally referred to as the processing time to process the document.

In other words, candidates for parallel processing are jobs with actions that are not easily distributed. It is noted that one does not usually distribute the rendering of a job because a user typically desires to retrieve a single correlated document at the device. Moreover, it is noted that one does not usually distribute the storing of a document since a user typically desires a single coherent document to be stored.

In the description given below, the system includes a number of computing networks with a variety of computational devices including a number of digital reprographic machines. Each digital reprographic machine may comprise several different components including, for example, printing, scanning, and document storage. Each digital reprographic machine has associated with it a computing resource, usually some sort of microprocessor based computer that includes hardware and software components that can handle the various digital reprographic machine functions.

These functions can include: network communications, conversion of page description language (PDL) descriptions of documents into print format digital images, conversion of scanned image documents into a variety of formats, and others. There are also other devices on the network besides the digital reprographic machines including a variety of computer resources which may include, among other things, both servers and workstations. A user at a workstation may choose to access the resources available on the network, in particular, those associated with the digital reprographic machines.

In the following discussion a process will be described wherein the processing of a printing job, submitted by a workstation in PDL form to one of the digital reprographic machines, can be shared between several of the machines on the network.

A network may include a variety of devices. For example, it might include a computer workstation, a file or computation server, as well as a number of digital reprographic machines. The exact topography of the network is not relevant as long as the devices can communicate with each other. The devices may have different rates of information transmission or communication overhead. In such a system, a large, divisible job (such as a large document that can be processed in pieces) is submitted to node n₀, which decides to split the large document into a number of smaller pieces and sends the smaller pieces to other nodes for parallel processing.

Traditional approaches assume homogeneous nodes and/or negligible communication overhead; however, node and network personalities should be taken into consideration, namely, node computation capabilities and communication overheads. Further, traditional parallel processing approaches assume that data domain is located at all the nodes and consequently there is no distribution cost. This assumption is not valid in document processing in office environment, in which a document is typically submitted to only one machine. As a result, distribution cost has to be taken into consideration.

To facilitate the distribution of the large document, node n₀ has to decide which among the n nodes (including itself) are to be selected to parallel process the large document. An example of this determination is as follows.

Node n₀ sends a message to each of the n−1 nodes asking about the nodes' computation capabilities. These messages are sent simultaneously. Messages are associated with a current time stamp, lifetime t_(lifetime), a sender ID list, and an addressee ID list of nodes from the set of n candidate nodes that are not on the sender list. Expired messages will be discarded by receiver nodes.

When a node receives the first query message, it immediately sends a feedback message to the sender and ignores following query messages from other senders. It also forwards to the members of the addressee list the query message with a new time stamp, its own ID removed from the addressee list and appended to the sender list. The process continues until the addressee list is empty or it is not worth splitting jobs represented by

$\delta < {\frac{{\lambda\mu}_{1}}{{\left( {1 + \alpha} \right)\mu_{0}^{2}\mu_{1}} + {{\lambda\mu}_{0}\left( {\mu_{0} + \mu_{1}} \right)}}.}$

Then, a spanning tree, as illustrated in FIG. 1, is constructed from the links corresponding to the first messages that first reached the node. This criterion determines the depth of the spanning tree.

The example illustrated in FIG. 1 shows a spanning tree where n=6. Node n₀ begins the process by sending messages to nodes n₁ through n₅. When these addressee nodes receive the message, the addressee nodes forward the message. So when node n₁ receives the message, node n₁ forwards it to nodes n₂ through n₅ .

Suppose that in this example, the communication between node n₀ and nodes n₃ through n₅ is slow. Although n₀ sends message to n₃, n₃ receives message from n₁ first, and as a result n₃ becomes direct child of n₁. In this example, FIG. 1 illustrates the same relationship nodes n₄ and n₅.

When the termination condition is met, the nodes will stop propagating the query messages. Then a spanning tree will be constructed where the route from n_(i) to n_(j) represents the fastest path between the two nodes.

To calculate the capabilities of each node, each leaf node, a leaf node being a node at the end of a span or branch of the spanning tree, reports the leaf node's computation capability and the communication overhead to its direct parent node. As shown in FIG. 1, n₃, n₄, and n₅ report to n₁, and n₂ reports to n₀. After intermediate nodes (n₁ in this example) collects all the response messages from its children nodes (n₃,n₄, and n₅ in this example), it sequences these nodes based on their reported computation capabilities and decides the number of children nodes using formula F₁ or F₂, as set forth below. This criterion determines the width of the spanning tree.

$\begin{matrix} \left. \begin{matrix} {{{Speedup}(n)} = {\frac{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}{\begin{matrix} {\left( {{{\delta\mu}_{0}^{2}\mu_{1}} + {\lambda\delta\mu}_{0}^{2} + {\lambda\mu}_{0} + {\mu_{0}\mu_{1}}} \right) +} \\ {\left( {{\alpha\mu}_{0} + {\delta\alpha\mu}_{0}^{2} + {\lambda\delta\mu}_{0}} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}} \end{matrix}} > {threshold}_{s}}} \\ {{{Efficiency}(n)} = {{\frac{1}{n} \times \frac{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}{\begin{matrix} {\left( {{{\delta\mu}_{0}^{2}\mu_{1}} + {\lambda\delta\mu}_{0}^{2} + {\lambda\mu}_{0} + {\mu_{0}\mu_{1}}} \right) +} \\ {\left( {{\alpha\mu}_{0} + {\delta\alpha\mu}_{0}^{2} + {\lambda\delta\mu}_{0}} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}} \end{matrix}}} > {threshold}_{e}}} \end{matrix} \right\} & F_{1} \\ \left. \begin{matrix} {{{Speedup}(n)} = {\frac{{\left( {1 + \alpha} \right)\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}{\begin{matrix} {\left\lbrack {{\left( {1 + \alpha} \right){\delta\mu}_{0}\mu_{1}} + \lambda + \mu_{1} - {\alpha\mu}_{0}} \right\rbrack +} \\ {\left( {{\lambda\delta} + \alpha} \right){\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}} \end{matrix}} > {threshold}_{s}}} \\ {{{Efficiency}(n)} = {{\frac{1}{n} \times \frac{{\left( {1 + \alpha} \right)\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}{\begin{matrix} {\left\lbrack {{\left( {1 + \alpha} \right){\delta\mu}_{0}\mu_{1}} + \lambda + \mu_{1} - {\alpha\mu}_{0}} \right\rbrack +} \\ {\left( {{\lambda\delta} + \alpha} \right){\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}} \end{matrix}}} > {threshold}_{e}}} \end{matrix} \right\} & F_{2} \end{matrix}$

Here for simplicity of the formula, the nodes have been relabeled such that the intermediate node is now node 0, and its direct children nodes have IDs as 1, 2, . . . , and m. λ_(i) is the communication overhead between node 0 and i; μ_(i) is the processing speed of node i; α is the ratio of document size after process compared to original document size.

Every intermediate node calculates and reports μ_(group) to its direct parent node using formula F₃ or F₄, until n₀ receives all the responses from its direct children nodes (n₁ and n₂ in this example).

$\begin{matrix} {\mu_{g_{n}} = \frac{1}{\left( {\frac{1}{\gamma_{d}} + \frac{1}{\gamma_{d}}} \right) + \frac{\lambda + \mu_{1} + {\alpha {\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}}} & F_{3} \\ {\mu_{g_{n}} = \frac{1}{\left( {\frac{1}{\gamma_{d}} + \frac{1}{\gamma_{c}}} \right) + \frac{\lambda + \mu_{1} - {\alpha\mu}_{0} + {\alpha {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}}} & F_{4} \end{matrix}$

Node n₀ partitions the document based on the computation capabilities and communication overhead of its direct children nodes (n₁ and n₂) and itself. If a child node branches to additional nodes as does node n₁ in the example, then the computation capability of its child node is actually the group capability of a sub tree rooted from the child node, for example the capability reported by node n₁ is the cumulated capability of the sub tree with node n₁ as the root. Then node n₀ sends the partitioned sub documents to its direct children nodes proportion to their capabilities.

After an intermediate node receives the sub documents, it further partitions it, and sends the further partitioned sub documents to its direct children nodes. After all the partitioned sub documents are sent out, the intermediate node can start processing the sub document assigned to itself while its direct children nodes are partitioning and distributing sub documents. Consequently, the partition and distribution is carried out on all the intermediate nodes in parallel.

After nodes finish processing the sub documents, the nodes send processed sub documents back to their direct parent node for aggregation. Similar to partitioning and distribution, the aggregation is carried out on all the intermediate nodes in parallel. The processed document is finally aggregated on original node (n₀ in this example) and returned to the user.

The parallel processing method discussed above can be applied to the type of applications, which can be partitioned into independent sub tasks for parallel processing and merged after processing. However, the physical environment in which an application is executed may require the process to be customized.

In the one environment (typically a cluster or local area network environment), the communication bandwidth is the same among all the nodes (λ₁=λ=100 Mbps). In such an environment, all the nodes will become the direct children nodes of node 0. Consequently, a one level tree can be constructed, as illustrated in FIG. 5.

In Grid environments, the communication bandwidth is not uniform. As shown in FIG. 6, bandwidth within one virtual organization is fast while bandwidth across virtual organizations is slower, and consequently one-level trees are constructed within virtual organizations and these trees are connected together in a tree structure according to the bandwidth across virtual organizations.

To accommodate the two environments, for nodes on the i^(th) level of the tree (i is the length of sender list) and i is even, the lifetime of query messages is

${t_{lifetime} = \frac{v_{message}}{\lambda}},$

where ν_(message) is the size of the query message and λ is the bandwidth within one virtual organization. Then messages sending to nodes in other virtual organizations will be discarded. In this way, a one level tree is constructed within one virtual organization.

For nodes on the i^(th) level of the tree (i is the length of sender list) and i is odd, the lifetime of query messages is set to be a large value, to connect nodes in different virtual organizations.

FIG. 7 compares the overall execution time using the method discussed above with a method not considering communication overhead as the number of nodes involved in parallel processing increases. FIG. 8 compares the speedup and efficiency factors of the two methods.

As shown in FIG. 9, node 0 is going to calculate the group capability of node 0˜n−1 based on the individual computation capabilities of node 0˜n−1 (μ₀˜μ_(n−1)) and the communication overheads (λ₁˜λ_(n−1)). Nodes 0˜n−1 are the children nodes of node 0, and they have similar communication speed (λ_(i)≈λ) according to the algorithm. Here μ₁(1≦i≦n−1) could be group computation capability of a sub tree rooted at node i.

A document of size ν_(i) is assigned to nodei, and the size may change to αv after being processed. For example, if a color document of size x bytes (each page is 8.5 by 11) is rasterized and compressed (at compression ratio of 1/10), then its size becomes

$8.5 \times 11 \times 600^{2} \times 4 \times \frac{1}{10}{{bytes}.}$

Then

$\alpha = {8.5 \times 11 \times 600^{2} \times 4 \times {\frac{1}{10}/v_{i}}}$

in this example.

The divide-parallelize-conquer process is illustrated in FIG. 9. At time t₀, node 0 receives the document of sizeν. At time t₁, node 0 finishes partitioning. The partition speed of node 0 is γ_(d) and as a result t₁=t₀+ν/γ_(d). After partitioning, node 0 starts sending partitioned documents to children nodes 1˜n−1 and then processing the portion of document assigned to itself simultaneously.

As illustrated in FIG. 9, the children nodes start to process the sub tasks at time t₁,t₂, . . . , t_(n−1) and send back processed document to node 0 at time t₁′, t₂′, . . . , t_(n−1)′. From time t₂′ when the first sub document is received, node 0 starts to merge sub documents and the merge speed of node 0 is γ_(c). At time t_(n)′, n₀ merges all the sub documents in the original sequence and sends back to its direct parent node.

The above description is represented below, as E₁. It is assumed that the time needed for partitioning and merging is proportional to the size of the operated document.

$\begin{matrix} {{T_{d} = \frac{\sum\limits_{i = 0}^{n - 1}v_{i}}{\gamma_{d}}},\mspace{14mu} {T_{c} = \frac{\alpha {\sum\limits_{i = 0}^{n - 1}v_{i}}}{\gamma_{c}}},\mspace{14mu} {T_{p_{n}} = {{\frac{v_{0}}{\mu_{0}}\mspace{14mu} \mu_{g_{n}}} = \frac{v}{T_{p_{n}} + T_{d} + T_{c}}}}} & \; \\ {{{\frac{1}{\lambda}{\sum\limits_{j = 1}^{i}v_{j}}} + \frac{v_{i}}{\mu_{i}} + {\frac{\alpha}{\lambda}{\sum\limits_{j = i}^{n - 1}v_{j}}}} = {{\frac{v_{0}}{\mu_{0}}\mspace{14mu} {where}\mspace{14mu} 1} \leq i \leq {n - 1}}} & E_{1} \end{matrix}$

${{Let}\mspace{14mu} s_{i}} = \frac{\prod\limits_{j = 1}^{i - 1}\; \left( {\lambda + {\alpha\mu}_{j}} \right)}{\prod\limits_{j = 2}^{i}\; \left( {\lambda + \mu_{j}} \right)}$

where 2≦i≦n−1 and s₀=s=s₁=1.

Then

$v_{i} = {\frac{s_{i}\mu_{i}}{\mu_{1}}v_{1}}$

where 1≦i≦n−1, and

$v_{1} = \frac{{\lambda\mu}_{1}v_{0}}{\mu_{0}\left( {\lambda + \mu_{1} + {\alpha {\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}} \right)}$

$v = {{\sum\limits_{i = 0}^{n - 1}v_{i}} = {\frac{v_{0}}{\mu_{0}} \times {\frac{{\lambda\mu}_{0} + {\mu_{0}\mu_{1}} + {\left( {\lambda + {\alpha\mu}_{0}} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}{\lambda + \mu_{1} + {\alpha {\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}.}}}$

Consequently,

$T_{p_{n}} = {\frac{\lambda + \mu_{1} + {\alpha {\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}} \times v}$

where n≧2, and

$T_{p_{1}} = {\frac{v_{0}}{\mu_{0}}.\begin{matrix} {\mu_{g_{n}} = \frac{v}{T_{d} + T_{c} + T_{p_{n}}}} \\ {= {{\frac{1}{\left( {\frac{1}{\gamma_{d}} + \frac{1}{\gamma_{d}}} \right) + \frac{\lambda + \mu_{1} + {\alpha {\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}{s_{i}\mu_{i}}}}}}{where}\mspace{14mu} n} \geq 2.}} \end{matrix}}$

According to Amdahl's law, the speedup of parallelism is

${{Speedup}(n)} = {\frac{T_{1}}{T_{n}}.\begin{matrix} {{{Speedup}(2)} = \frac{T_{1}}{T_{n}}} \\ {= \frac{v}{T_{p_{2}} + T_{d} + T_{c}}} \\ {= {\frac{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}}{\begin{matrix} {{\frac{T_{d} + T_{c}}{v} \times \mu_{0} \times \left\lbrack {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}} \right\rbrack} +} \\ {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda\mu}_{0}} \end{matrix}}.}} \end{matrix}}$

Only when

${{{\frac{T_{d} + T_{c}}{v} \times \mu_{0} \times \left\lbrack {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}} \right\rbrack} - {\lambda\mu}_{1}} < 0},$

node 0 splits jobs.

Let

$\delta = {\frac{T_{d} + T_{c}}{v} = {\frac{1}{\gamma_{d}} + \frac{\alpha}{\gamma_{c}}}}$

be the overhead incurred by parallel processing.

When

${\delta < \frac{\lambda \; \mu_{1}}{{\left( {1 + \alpha} \right)\mu_{0}^{2}\mu_{1}} + {\lambda \; {\mu_{0}\left( {\mu_{0} + \mu_{1}} \right)}}}},$

it is worth splitting.

When node 0 splits, it should decide how many nodes (n) are involved.

${\Delta \; T} = {{T_{n} - T_{n + 1}} = {{T_{p_{n}} - T_{p_{n + 1}}} = {\frac{s_{n}\mu_{n}{\lambda \left( {\lambda + \mu_{1}} \right)}}{\begin{matrix} {\left\lbrack {{\mu_{0}\mu_{1}} + {\lambda \; \mu_{0}} + {\left( {{\alpha \; \mu_{0}} + \lambda} \right){\sum\limits_{i = 1}^{n}\; {s_{i}\mu_{i}}}}} \right\rbrack \times} \\ \left\lbrack {{\mu_{0}\mu_{1}} + {\lambda \; \mu_{0}} + {\left( {{\alpha \; u_{0}} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}\; {s_{i}\mu_{i}}}}} \right\rbrack \end{matrix}} > 0}}}$

Executing a job on n+1 nodes will take less execution time than that on n nodes, where n≧2.

The speedup parameter is denoted by

${{Speedup}(n)} = {\frac{T_{1}}{T_{d} + T_{c} + T_{p_{n}}} = \frac{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}\; {s_{i}\mu_{i}}}}}{\begin{matrix} {\left( {{{\delta\mu}_{0}^{2}\mu_{1}} + {\lambda \; {\delta\mu}_{0}^{2}} + {\lambda\mu}_{0} + {\mu_{0}\mu_{1}}} \right) +} \\ {\left( {{\alpha\mu}_{0} + {{\delta\alpha}\; \mu_{0}^{2}} + {\lambda\delta\mu}_{0}} \right){\sum\limits_{i = 1}^{n - 1}\; {s_{i}\mu_{i}}}} \end{matrix}}}$

Intuitively, as node 0 splits the job into more pieces, the decreasing rate of overall execution time will slow down and eventually does not have significant changes, evaluated by

${{Efficiency}(n)} = {{\frac{{Speedup}(n)}{n}.{{Efficiency}(n)}} = {{\frac{1}{n} \times \frac{{\mu_{0}\mu_{1}} + {\lambda\mu}_{0} + {\left( {{\alpha\mu}_{0} + \lambda} \right){\sum\limits_{i = 1}^{n - 1}\; {s_{i}\mu_{i}}}}}{\begin{matrix} {\left( {{{\delta\mu}_{0}^{2}\mu_{1}} + {\lambda \; {\delta\mu}_{0}^{2}} + {\lambda\mu}_{0} + {\mu_{0}\mu_{1}}} \right) +} \\ {\left( {{\alpha\mu}_{0} + {{\delta\alpha}\; \mu_{0}^{2}} + {\lambda\delta\mu}_{0}} \right){\sum\limits_{i = 1}^{n - 1}\; {s_{i}\mu_{i}}}} \end{matrix}}\mspace{14mu} {where}\mspace{14mu} n} \geq 2.}}$

According to the algorithm, node 0 adds node ito its children nodes in the sequence of μ_(i) decreasing. As a result, the overall execution time saved by adding one more node is becoming less significant. In other words, Efficiency(n) decreases as n increases.

The relationship of Speedup(n)and Efficiency(n) is plotted in FIGS. 10 and 11, respectively. For simplicity, assume α=1 (document size does not change after processing), μ_(i)=μ(all the involved nodes have the same computational capability), δ=0 (the time for dividing and conquering is negligible). Then

${{Speedup}(n)} = {{\frac{{n\frac{\lambda}{\mu}} + n}{\frac{\lambda}{\mu} + n}\mspace{14mu} {and}\mspace{14mu} {{Efficiency}(n)}} = {\frac{\frac{\lambda}{\mu} + 1}{\frac{\lambda}{\mu} + n}.}}$

The scalability of a parallel processing method is the ability to maintain parallel processing gain when both problem size and system size increase. The scalability is defined as

${{{Scalability}\left( {m,n} \right)} = {\frac{n \times w_{m}}{m \times w_{n}} = \frac{T_{m}\left( w_{m} \right)}{T_{n}\left( w_{n} \right)}}},$

where w_(n) is the work executed when n processors are employed and T_(n) is the execution time, w_(m) is the work executed when m processors are employed and T_(m) is the execution time. Ideally, if the overall workload and the number of processors involved both scale up m times, the execution time for scaled workload keeps the same as that for the original workload, namely, Scalability(m,n)=1.

Assuming μ_(i)=μ and then the termination conditions are

${{Speedup}(n)} = {\frac{\mu^{2} + {\lambda\mu} + {\left( {{\alpha\mu}^{2} + {\lambda\mu}} \right){\sum\limits_{i = 1}^{n - 1}\; s_{i}}}}{\begin{matrix} {\left( {{\delta \; \mu^{3}} + {\lambda\delta\mu}^{2} + {\lambda\mu} + \mu^{2}} \right) +} \\ {\left( {{\alpha\mu}^{2} + {\delta\alpha\mu}^{3} + {\lambda\delta\mu}^{2}} \right){\sum\limits_{i = 1}^{n - 1}\; s_{i}}} \end{matrix}} > {th}_{s}}$ and ${{Efficiency}\; (n)} = {{\frac{1}{n} \times \frac{\mu^{2} + {\lambda\mu} + {\left( {{\alpha\mu}^{2} + {\lambda\mu}} \right){\sum\limits_{i = 1}^{n - 1}\; s_{i}}}}{\begin{matrix} {\left( {{\delta \; \mu^{3}} + {\lambda\delta\mu}^{2} + {\lambda\mu} + \mu^{2}} \right) +} \\ {\left( {{\alpha\mu}^{2} + {\delta\alpha\mu}^{3} + {\lambda\delta\mu}^{2}} \right){\sum\limits_{i = 1}^{n - 1}\; s_{i}}} \end{matrix}}} > {{th}_{e}.}}$

As a result, the number of processors n is determined independent of workload w_(n). Therefore,

${{Scalability}\; \left( {m,n} \right)} = {\frac{w_{m}}{w} \neq {1\mspace{14mu} {where}\mspace{14mu} w_{m}} \neq {w_{n}.}}$

${T_{d} = \frac{\sum\limits_{i = 0}^{n - 1}\; v_{i}}{\gamma_{d}}},{T_{c} = \frac{\alpha {\sum\limits_{i = 0}^{n - 1}\; v_{i}}}{\gamma_{c}}},{T_{p_{n}} = {\frac{v_{0}}{\mu_{0}} + {\frac{\alpha}{\lambda}{\sum\limits_{i = 2}^{n - 1}\; v_{i}}}}}$ $\mu_{g} = {\frac{v}{T_{p_{n}} + T_{c}} = \frac{\sum\limits_{i = 0}^{n - 1}\; v_{i}}{T_{p_{n}} + T_{c}}}$ ${{\frac{v_{1}}{\lambda}\left( {1 + \alpha} \right)} + \frac{v_{1}}{\mu_{1}}} = \frac{v_{0}}{\mu_{0}}$ ${\left( {\frac{1}{\mu_{i}} + \frac{1}{\lambda}} \right) \times v_{i}} = {\left( {\frac{1}{\mu_{i - 1}} + \frac{\alpha}{\lambda}} \right) \times v_{i - 1}}$ where 2 ≤ i ≤ n − 1…  E₁

Let

$s_{i} = \frac{\prod\limits_{j = 1}^{i - 1}\; \left( {\lambda + {\alpha\mu}_{j}} \right)}{\prod\limits_{j = 2}^{i}\; \left( {\lambda + \mu_{j}} \right)}$

where 2≦i≦n−1, and s₀=s₁=1.

Then

$v_{i} = {\frac{s_{i}\mu_{i}}{\mu_{1}}v_{1}}$

where 1≦i≦n−1,

$v = {{\sum\limits_{i = 0}^{n - 1}\; v_{i}} = {\frac{v_{0}}{\mu_{0}} \times {\frac{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}\; \left( {s_{i}\mu_{i}} \right)}}}{\lambda + {\left( {1 + \alpha} \right)\mu_{1}}}.}}}$

Consequently,

$T_{p_{n}} = {\frac{\lambda + \mu_{1} - {\alpha\mu}_{0} + {\alpha {\sum\limits_{i = 0}^{n - 1}\; {s_{i}\mu_{i}}}}}{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}\; {s_{i}\mu_{i}}}}} \times v}$

where n≧2, and

$T_{p_{1}} = {{\frac{v}{\mu_{0}} \cdot \mu_{g_{n}}} = {\frac{v}{T_{d} + T_{c} + T_{p_{n}}} = {\frac{1}{\left( {\frac{1}{\gamma_{d}} + \frac{1}{\gamma_{c}}} \right) + \frac{\lambda + \mu_{1} - {\alpha\mu}_{0} + {\alpha {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}}.}}}$

According to Amdahl's law, the speedup of parallelism is

${{Speedup}(n)} = {\frac{T_{1}}{T_{n}}.}$

${{Speedup}(2)} = {\frac{T_{1}}{T_{2}} = {\frac{T_{1}}{T_{p_{2}} + T_{d} + T_{c}} = \frac{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}}{{\frac{T_{d} + T_{c}}{v} \times \mu_{0} \times \left\lbrack {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}} \right\rbrack} + {\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda\mu}_{0}}}}$

Only when

${{{\frac{T_{d} + T_{c}}{v} \times \mu_{0} \times \left\lbrack {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}} \right\rbrack} - {\lambda\mu}_{1}} < 0},$

node 0 splits jobs. Let

$\delta = {\frac{T_{d} + T_{c}}{v} = {\frac{1}{\gamma_{d}} + \frac{\alpha}{\gamma_{c}}}}$

be the overhead incurred by parallel processing. When

${\delta < {\frac{{\lambda\mu}_{1}}{\mu_{0}} \times \frac{1}{{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda \left( {\mu_{0} + \mu_{1}} \right)}}}},$

it is worth splitting.

When node 0 splits, it should decide how many nodes (n) are involved.

${T_{n} - T_{n + 1}} = {{T_{p_{n}} - T_{p_{n + 1}}} = {\frac{\lambda^{2} + {\left( {\mu_{1} - {\alpha\mu}_{0}} \right)\lambda} - {{\alpha \left( {1 + \alpha} \right)}\mu_{0}\mu_{1}}}{\left\lbrack {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}} \right\rbrack \times \left\lbrack {{\left( {1 + \alpha} \right)\mu_{0}\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n}{s_{i}\mu_{i}}}}} \right\rbrack} \times s_{n}\mu_{n}v}}$

If

${\lambda > \frac{{\alpha\mu}_{0} - \mu_{1} + \sqrt{\left( {{\alpha\mu}_{0} + \mu_{1}} \right)^{2} + {4\alpha^{2}\mu_{0}\mu_{1}}}}{2}},$

executing a job on n+1 nodes will take less execution time than that on n nodes, where n≧2.

${{Speedup}(n)} = {\frac{T_{1}}{T_{d} + T_{c} + T_{p_{n}}} = {\frac{{\left( {1 + \alpha} \right)\mu_{1}} + {\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}{\left\lbrack {{\left( {1 + \alpha} \right){\delta\mu}_{0}\mu_{1}} + \lambda + \mu_{1} - {\alpha\mu}_{0}} \right\rbrack + {\left( {{\lambda\delta} + \alpha} \right){\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}.}}$

Intuitively, as node 0 splits the job into more pieces, the decreasing rate of overall execution time will slow down and eventually does not have significant changes, evaluated by

${{Efficiency}(n)} = {\frac{{Speedup}(n)}{n}.}$

${{Efficiency}(n)} = {\frac{\frac{\left( {1 + \alpha} \right)\mu_{1}}{n} + \frac{\lambda {\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}{n\; \mu_{0}}}{\left\lbrack {{\left( {1 + \alpha} \right){\delta\mu}_{0}\mu_{1}} + \lambda + \mu_{1} - {\alpha\mu}_{0}} \right\rbrack + {\left( {{\lambda\delta} + \alpha} \right){\sum\limits_{i = 0}^{n - 1}{s_{i}\mu_{i}}}}}.}$

According to the algorithm, node 0 adds node i to its children nodes in the sequence of μ_(i) decreasing. As a result, the overall execution time saved by adding one more node is becoming less significant. In other words, Efficiency(n) decreases as n increases.

The relationship of Speedup(n)and Efficiency(n) is plotted in FIGS. 13 and 14, respectively. For simplicity, assume α=1 (document size does not change after processing), μ_(i)=μ (all the involved nodes have the same computational capability), δ=0 (the time for dividing and conquering is negligible).

According to

${\lambda > \frac{{\alpha\mu}_{0} - \mu_{1} + \sqrt{\left( {\mu_{1} + {\alpha\mu}_{0}} \right)^{2} + {4\alpha^{2}\mu_{0}\mu_{1}}}}{2}},$

λ>{square root over (2)}μ so that node 0 will split a job.

${{Speedup}(n)} = {{\frac{{2\mu} + {n\; \lambda}}{\lambda + {n\; \mu}}\mspace{14mu} {and}\mspace{14mu} {{Efficiency}(n)}} = {\frac{{2\mu} + {n\; \lambda}}{{n\; \lambda} + {n^{2}\; \mu}} < 1.}}$

As shown in FIG. 13, Speedup(n) increases as n increases, namely, increasing the number of nodes (processors) for parallel processing, we can speed up the overall execution time. Further, Speedup(n) increases faster with a larger λ/μ, which means in an environment with higher communication speed, using more nodes for parallel processing can significantly decrease the overall execution time. However, the upper bound of Speedup(n) is λ/μ.

As shown in FIG. 14, Efficiency(n) decreases as n increases. It si noted that the upper bound of Efficiency(n) is 1.

Assuming μ_(i)=μ and then the termination conditions are

${{Speedup}(n)} = {\frac{{\left( {1 + \alpha} \right)\mu} + {\lambda {\sum\limits_{i = 0}^{n - 1}s_{i}}}}{\left\lbrack {{\left( {1 + \alpha} \right){\delta\mu}^{2}} + \lambda + \mu - {\alpha\mu}} \right\rbrack + {\left( {{\lambda\delta\mu} + {\alpha\mu}} \right){\sum\limits_{i = 0}^{n - 1}s_{i}}}} > {{th}_{s}\mspace{14mu} {and}}}$ ${{Efficiency}(n)} = {\frac{\frac{\left( {1 + \alpha} \right)\mu}{n} + \frac{\lambda {\sum\limits_{i = 0}^{n - 1}s_{i}}}{n}}{\left\lbrack {{\left( {1 + \alpha} \right){\delta\mu}^{2}} + \lambda + \mu - {\alpha\mu}} \right\rbrack + {\left( {{\lambda\delta\mu} + {\alpha\mu}} \right){\sum\limits_{i = 0}^{n - 1}s_{i}}}} > {{th}_{e}.}}$

As a result, the number of processors n is determined independent of workload w_(n). Therefore,

${{Scalability}\left( {m,n} \right)} = {\frac{w_{m}}{w_{n}} \neq 1}$

where w_(m)≠w_(n).

FIGS. 2 through 4 show a flow chart of how a digital reprographic machine on a network can set up an ad hoc distributed processing network to share the processing of the print job. The process consists of four phases which overlap in time.

The first phase begins in step S202 of FIG. 2 when a node, henceforth referred to as node 0, receives a job to be processed. Node 0 determines in step S204 that it would be desirable to split the job and engage extra resources on the network. This decision may be based on criteria such as the size of the job, the current workload of the node, the priority or urgency of the job or other factors. For sufficiently large jobs the overhead of establishing connections to other resources and partitioning the job may be lower than the gain in time by partitioning and sharing the processing.

In step S206, node 0 sends a broadcast message to other devices on the network. The broadcast message has the ID of the sender node, a time stamp, and an expiration time. In step S208, each node that receives the message responds to the sender, and at the same time stops responding to any other requests.

However, since there may be some nodes whose communication links to node 0 are relatively slow, each node that receives the message resends the message in step S210, substituting its own sender ID. Steps S206 through S210 are repeated recursively by all nodes on the network until the expiration time of the message. Alternatively, a node may choose to not resend the message if its estimate of the overhead of partitioning is too high.

The second phase of the process, whose start overlaps the first phase, begins in step S212 of FIG. 3. Each node computes its communication overhead with its parent node and also a measure of its effective processing capability. For nodes that have children, the effective computing capacity is a composite measure of the combined processing capacity of the node and all of its child nodes, where the composite measure includes the effects of both the communication overhead and the overhead associated with partitioning any received work. After computing these factors, the node sends this information to its parent node. The processing in step S212 continues recursively as shown in step S214 until all children have responded, and node 0 has the results from all of its direct child nodes.

In the third phase, node 0 now partitions the job into several subtasks and distributes the subtasks to its child nodes in step S216 of FIG. 3. The subtasks are not necessarily equal in size, but may be weighted to provide larger work loads to those child nodes that report larger effective processing capacity. Each node that receives a subtask now either processes it directly or further partitions it and distributes the partitioned pieces to its child nodes in step S218.

In the fourth phase, after each node finishes processing its assigned part of the work, it sends the results to its parent node in step S220 of FIG. 4. The parent node reassembles the individual subtasks from its children until it has received all of them and then returns the assembled result to its parent in turn in step S222. When all subtasks have been reassembled at node 0, the job is complete and the result is returned to the originator.

Every node knows the composite capacities of its direct child nodes and can estimate their processing times. If a node fails and stops processing, its direct parent node can discover the node failure when no result comes back from the node after the estimated completion time. Since the parent node knows all the child nodes of this failed node during phase 2, it can collect the results from these children and follow the steps in phase 1 to assign the unfinished task (which is originally assigned to the failed node) to other node(s). Node failure causes delay in job completion time, but it does not have significant impact in overall finishing time since recovery happens when other nodes process their assigned tasks or consolidate partial results.

While a parent node is partitioning and assigning tasks to its child nodes, the child nodes which have already received their tasks can immediately start processing. After the parent node finishes task assigning, it can start processing its own task. The parent node can continue processing as its child nodes finish tasks and are return results. In this way, communication time is overlapped with computation time.

The process of FIGS. 2 through 4 is dynamic in the sense that it does not require a fixed network description but is capable of utilizing any resources that are available at the time a job is processed. If a node is busy, the node may chose to not respond to any query message or may respond with an indication of reduced capability. In a similar fashion, the method of FIGS. 2 through 4 is efficient in that it takes into account the communication overhead between nodes as well as the computing capability of each node. This allows one to maximize the overall throughput of the combined computing resources.

A further feature of the method illustrated in FIGS. 2 through 4 is that not all of the nodes have to be identical, but the method will work in a homogeneous environment. For example, not all the nodes for the example of distributed PDL printing have to be digital reprographic machines, but can be any computing resource on the network. Such resources might include workstations or servers that are equipped with the appropriate software to allow the node to perform the computing task being distributed.

The decision of how far to partition any particular job may depend on several factors—the speed with which any node can partition a job into subtasks, and the cost in terms of combined communication overhead and processing speed of any child node. In general, if the speedup by adding one more subtask and child node is less than the cost in terms of time to do the partitioning and allocation, there is a net gain. It is possible, by analyzing the combined effect of a network of nodes, to compute the effective speedup. Thus the decision as to how far to partition any particular job becomes one of simple computation rather than some estimation or guesswork.

It will be appreciated that various the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. 

1. A method of parallel processing, an electronic document to undergo disposition on a first device, in a distributed environment, the distributed environment including the first device and a plurality of distributed processors, each distributed processor being located at a node in the distributed environment, comprising: (a) computing an overall processing time to process, at the first device, the electronic document to undergo disposition; (b) computing a disposition time to dispose of, at the first device, the electronic document to undergo disposition; (c) parsing the electronic document to undergo disposition into a first document and sub-documents when the overall processing time to process at the first device is greater than the disposition time at the first device; (d) determining, at the first device, a productivity capacity of each node, the productivity capacity being a measured of the processing power of the node and the communication cost of exchanging information between the first device and the node; (e) processing a sub-document at a node, the node processing the sub-document when a productivity capacity of the node reduces the processing time to process the electronic document to be disposed to be less than the computed overall processing time; (f) aggregating a processed first document and each processed sub-document to create a processed electronic document to be disposed; and (g) disposing of the processed electronic document to undergo disposition on the first device.
 2. The method as claimed in claim 1, wherein the undergoing dispositon is the rendering of the document.
 3. The method as claimed in claim 1, wherein the processing of the document is the rasterization of the document.
 4. The method as claimed in claim 1, wherein the undergoing disposition is the storage of the document.
 5. The method as claimed in claim 1, wherein the processing of the document is the optical character recognition of the document.
 6. The method as claimed in claim 1, wherein a size of a sub-document is proportional to the processing power of the node which will process the sub-document.
 7. The method as claimed in claim 1, wherein the processing power of a node is the combined processing power of the node and direct children nodes connected thereto.
 8. The method as claimed in claim 7, further comprising: (h) parsing the sub-document, at the node having direct children nodes connected thereto, into a first sub-document and partial sub-documents.
 9. The method as claimed in claim 8, wherein a size of a partial sub-document is proportional to the processing power of the children node which will process the partial sub-document.
 10. The method as claimed in claim 8, further comprising: (i) aggregating a processed first sub-document and each processed partial sub-document at the node having direct children nodes connected thereto, to create a processed sub-document.
 11. The method as claimed in claim 8, wherein the node having direct children nodes connected thereto determines a productivity capacity of each child node, the productivity capacity being a measured of the processing power of the child node and the communication cost of exchanging information between the node having direct children nodes connected thereto and the child node, the communication cost is based upon a bandwidth between the node having direct children nodes connected thereto and a child node.
 12. The method as claimed in claim 1, wherein the communication cost is based upon a bandwidth between the first node and a node.
 13. A method of parallel processing, an electronic document to undergo disposition at a first node, in a distributed environment, the distributed environment including the first node and a plurality of nodes, comprising: (a) computing a productivity capacity factor for the first node, the productivity capacity factor being a measured of the processing power of the first node; (b) determining a nodal productivity capacity factor for each node, the nodal productivity capacity being a measured of the processing power of the node and the communication cost of exchanging information between the first node and the node; (c) selecting a combination of nodes having an effective nodal productivity capacity factor less than the productivity capacity factor for the first node; (d) parsing the electronic document to undergo disposition into a first sub-document and sub-documents when a combination of nodes having an effective nodal productivity capacity factor less than the productivity capacity factor for the first node, the number of sub-documents being equal to a number of nodes in the combination of nodes having an effective nodal productivity capacity factor less than the productivity capacity factor for the first node; (e) processing the sub-documents at the nodes forming the combination of nodes having an effective nodal productivity capacity factor less than the productivity capacity factor for the first node; (f) aggregating a processed first document and each processes sub-document to create a processed electronic document to undergo disposition; and (g) rendering processed electronic document to undergo disposition at the first node.
 14. The method as claimed in claim 13, wherein the sub-documents are processed at the nodes forming the combination of nodes having an effective nodal productivity capacity factor less than the productivity capacity factor for the first node.
 15. The method as claimed in claim 13, wherein a size of a sub-document is proportional to the processing power of the node which will process the sub-document.
 16. The method as claimed in claim 13, wherein the processing power of a node is the combined processing power of the node and direct children nodes connected thereto.
 17. The method as claimed in claim 16, further comprising: (h) parsing the sub-document, at the node having direct children nodes connected thereto, into a first sub-document and partial sub-documents.
 18. The method as claimed in claim 17, further comprising: (i) aggregating a processed first sub-document and each processed partial sub-document at the node having direct children nodes connected thereto, to create a processed sub-document.
 19. The method as claimed in claim 17, wherein the node having direct children nodes connected thereto determines a productivity capacity of each child node, the productivity capacity being a measured of the processing power of the child node and the communication cost of exchanging information between the node having direct children nodes connected thereto and the child node, the communication cost is based upon a bandwidth between the node having direct children nodes connected thereto and a child node.
 20. The method as claimed in claim 13, wherein the communication cost is based upon a bandwidth between the first node and a node. 